RED WINE EDA BY AMONAH ALI

Red wine dataset is being analysed to find the variable that has the most affecting on the wine quality

Univariate Plots Section

## [1] 1599   13
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

quality values is distributed between 3 and 8 also a mean of 5.6 and a median of 6

looking at all other variable to see if it is affect the quality or not

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

volatile.acidity values is distributed between 0.12 and 1.58 also a mean of 0.527 and a median of 0.520

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

pH values is distributed between 2.7 and 4.01 also a mean of 3.311 and a median of 3.310

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

residual.sugar values is distributed between 0.900 and 15.500 also a mean of 2.539 and a median of 2.200

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

density values is distributed between 0.9901 and 1.0037 also a mean of 0.9967 and a median of 0.9968

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

fixed.acidity values is distributed between 4.60 and 15.90 also a mean of 8.32 and a median of 7.90

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

free.sulfur.dioxide values is distributed between 1 and 72.00 also a mean of 15.87 and a median of 14.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

sulphates values is distributed between 0.3300 and 2.0000 also a mean of 0.6581 and a median of 0.6200

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

alcohol values is distributed between 8.40 and 14.90 also a mean of 10.42 and a median of 10.20

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

total.sulfur.dioxide values is distributed between 6.00 and 289.00 also a mean of 46.47 and a median of 38.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

chlorides values is distributed between 0.01200 and 0.61100 also a mean of 0.08747 and a median of 0.07900

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

citric.acid values is distributed between 0 and 1 also a mean of 0.271 and a median of 0.260

Univariate Analysis

What is the structure of your dataset?

1599 of wine data and a number of 12 variables, the main variable is quality with a range of 3 to 8 but it shown that there is a number of samples with a range between 5 and 6 also most diagrams is right skew. ### What is/are the main feature(s) of interest in your dataset? the main feature is quality because wine is being catogrized for wine taker based on its quality level either it is heigh, average and low and their are other variables that influnce the quality. ### What other features in the dataset do you think will help support your into your feature(s) of interest? their is a number of features that will help in the investigation such as Alcohol, citric acid, volatile acidity, sulphates ### Did you create any new variables from existing variables in the dataset? wine level variable is being created for knowing the level of the wine quality and it is divided into 3 level high, average and low

## 'data.frame':    1599 obs. of  14 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
##  $ quality.lvl         : chr  "average" "average" "average" "average" ...

the plot displays the quality level and that a large number of the quality level is at average

Of the features you investigated, were there any unusual distributions? you perform any operations on the data to tidy, adjust, or change the form the data? If so, why did you do this?

the data is tidy and only a new column were added that is wine level for the quality of the wine also x variable were deleted for not being useful in the dataset

Bivariate Plots Section

the first factor to look at is the alcohol after that others like volatile acidity, sulphates and citric acid will be look at also look at the correlation of factors

above plots shows that higher alcohol makes a better wine quality but a look at other variables relationships with quality is must because not only alcohol affect the quality

the quality increase when the sulphates increase but it also shows that at the high level of sulphates it negativly affect the quality

the plot shows that volatile.acidity decreese with quality increasing so it is an Inverse relationship

the quality level increase with citric acid increase

there is a little correlation between alcohol and sulphates

no correlation between alcohol and citric acid.

a strong correlation between total sulfur dioxide and free sulfur dioxide in a positive way

a strong correlation between between fixed acidity and citric acid in a positive way

sulphates and chlorides shows cluster

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

  1. what affect the quality of wine the most is alcohol
  2. a combination of alcohol and other factors affect the quality better than the alcohol alone 3)sulphates, citric acid and alcohol have a high correlation of quality
  3. volatile acidity have a low correlation of quality

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

1)chlorides and sulphates has an interesting relationship 2)a strong positive correlation between citric acid and fixed acidity is founded

What was the strongest relationship you found?

the strongest relationship is the alcohol

Multivariate Plots Section

looking at the relationships of factors with focusing on the quality as a color

it shows with higher sulphates it produce higher alcohol

density does not affect the alcohol

an Inverse relationship between volatile.acidity and alcohol when alcohol increase volatile.acidity decrease

the alcohol increase when citric acid increase

the same result in volatile.acidity and alcohol with the pH and alcohol which is alcohol increase when pH decrease

total.sulfur.dioxide decrease produce high alcohol

no correlations between fixed.acidity and volatile.acidity

volatile.acidity decrease and citric.acid increase produce high quality

no correlations between fixed.acidity and citric.acid

a range of sulphates and chlorides will produce high quality of wine

Multivariate Analysis

Talk about some of the relationships you observed in this part of the . Were there features that strengthened each other in terms of at your feature(s) of interest?

  1. the combination of chlorides and sulphates affect the quality of wine posivitly
  2. alcohol volatile acidity and sulphates help in making the wine quality better
  3. the correlation of crtic acid and alcohol also affect the wine quality posivitly

Were there any interesting or surprising interactions between features?

the interesting interaction is between the chlorides and sulphates


Final Plots and Summary

Plot One

Description One

the first plot is for the quality of the wine which is the main feature in this dataset and it shows a high number of values for 5, 6 and 7 and it shows a low number of values for 3 and 4

Plot Two

Description Two

the third plot shows the sulphate and alcohol is high when the wine quality is high and the alcohol and sulphate together make a high quality wine and affect the quality posivitly

Plot Three

Description Three

the second plot shows the relationship between quality and alcohol and how much alcohol affect the quality in a positive way and alcohol mean increase with the high quality

Reflection

1599 and 12 variabled in the red wine dataset. an analysis was contucted with starting with the main feature in the dataset which is quality because a wine in nothing without its quality also other factors were analysed individualy, after that going to the next important factor that affect the quality is the alcohol also there is other factors that affect the quality that are citric acid, volatile acidity and sulphates. A strong positive correlation between total and free sulfur dioxide and fixed acidity and citric acid has been founded also a strange realtionship between sulphates and chlorides that shape a cluster. at the multivariate plots a relationship of other factors with focusing on the quality as a color and it has been contucted that alcohol is the main factor that affect the quality and others factors that help were citric acid, volatile acidity and sulphates at the end the limitation of this dataset is that is a large range of quality between 5 and 6 also there is no quality level is given that why it has been created that why in future work i hope to analyse a dataset similar to red wine dataset that has a classification or a level on the quality of the wine.